Take Home Assignment - Origin Financial

Overview

Introduction

Engineering Raw Data

Before starting the data analysis, let’s do a quickly review at available data sets and raise needs of pre-engineering.

Customers

## Rows: 803
## Columns: 7
## $ id            <chr> "131d8363-e2a6-4c60-acef-da03e6bfbdc4", "831216da-744b-4…
## $ created_at    <chr> "2020-03-11 19:38:35", "2020-05-01 13:09:00", "2020-03-1…
## $ date_of_birth <chr> "1990-08-16", "1933-11-22", "1986-02-02", "1990-01-01", …
## $ gender        <chr> "male", "", "", "male", "", "", "", "male", "male", "mal…
## $ country       <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "U…
## $ state         <chr> "CA", "WA", "NY", "CA", "", "CA", "", "CA", "NY", "CA", …
## $ city          <chr> "San Francisco", "Tonasket", "New York", "San Francisco"…

In a quick overview, it’s possibly to note that variable created_at and date_of_birth should be modified to timestamp and date. Also, we can note that all missing data are represented by empty space and it will be replaced by NA.

The variables state and city has some strange characters like this: <img src='#' onerror=alert('xss') /> this strange behavior probably came from a css operator and it will be replaced by NA.

Transactions

## Rows: 11,059
## Columns: 17
## $ X_id                                     <chr> "4D2119A1-03D9-48F2-99B7-FEDD…
## $ user_id                                  <chr> "94102846-0B6D-45D8-AFD1-DA80…
## $ account_id                               <chr> "88F8D694-AFF0-4CFC-BBB6-C89D…
## $ account_name                             <chr> "Chase - Plaid Money Market (…
## $ description                              <chr> "ACH Electronic CreditGUSTO P…
## $ type                                     <chr> "expense", "expense", "expens…
## $ amount                                   <dbl> -5850, -5850, -5850, -5850, -…
## $ date                                     <chr> "2021-06-08T00:00:00.000Z", "…
## $ extra_fields.category.0                  <chr> "Transfer", "Transfer", "Tran…
## $ extra_fields.category.1                  <chr> "Debit", "Debit", "Debit", "D…
## $ extra_fields.category.2                  <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.category_id                 <int> 21006000, 21006000, 21006000,…
## $ extra_fields.merchant_name               <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.name                        <chr> "ACH Electronic CreditGUSTO P…
## $ extra_fields.payment_channel             <chr> "other", "other", "other", "o…
## $ extra_fields.payment_meta.payment_method <chr> "ACH", "ACH", "ACH", "ACH", "…
## $ created_at                               <chr> "2021-06-18T19:14:32.075Z", "…

Exploratory Data Analysis

## Warning: Couldn't find skimmers for class: POSIXlt, POSIXt; No user-defined
## `sfl` provided. Falling back to `character`.
Data summary
Name db_users
Number of rows 803
Number of columns 8
_______________________
Column type frequency:
character 6
Date 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1.00 36 36 0 803 0
created_at 0 1.00 2414 5633 0 803 0
gender 751 0.06 4 10 0 4 0
country 0 1.00 2 2 0 1 0
state 179 0.78 2 2 0 27 0
city 176 0.78 4 21 0 120 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_of_birth 173 0.78 1000-01-01 2001-01-01 1989-12-12 119

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age_at 173 0.78 33.56 40.21 19.41 29.63 31.08 33.99 1020.33 ▇▁▁▁▁
  • falar sobre o outlier na data
  • falar sobre o número de missing no gender

Univariate Analysis

Age

State

## Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
## Need '+proj=longlat +datum=WGS84'
  • talk about: more than half of the sample is from CA or has no available state

Users Transactions

  • talk that we have just 84 customer with transactions
## Storing counts in `nn`, as `n` already present in input
## ℹ Use `name = "new_name"` to pick a new name.
Number of Accounts Number of Customers % of Customers
1 70 83.33
2 7 8.33
3 1 1.19
4 2 2.38
5 2 2.38
6 1 1.19
7 1 1.19
  • falar que quase 20% dos customers com transacoes tem mais de uma conta, porem para as analises vou assumir apenas o id do customer

Type of transacions

  • falar que tem uma mistura de gastos e salários com transferências e que isso deveria ser visto apartado
  • começar a falar da ideia do rfm

Transactions Category

extra_fields_category_0 extra_fields_category_1 extra_fields_category_2 extra_fields_category_id n perc
NA NA NA NA 2742 100
Travel Taxi NA 22016000 1230 100
Food and Drink Restaurants Fast Food 13005032 1057 100
Travel Airlines and Aviation Services NA 22001000 1055 100
Food and Drink Restaurants NA 13005000 1053 100
Payment Credit Card NA 16001000 623 100
Food and Drink Restaurants Coffee Shop 13005043 614 100
Transfer Credit NA 21005000 607 100
Payment NA NA 16000000 443 100
Recreation Gyms and Fitness Centers NA 17018000 443 100
Shops Sporting Goods NA 19046000 443 100
Transfer Debit NA 21006000 425 100
Transfer Deposit NA 21007000 324 100
  • Falar um pouco sobre as categorias e que sera explorado um pouco mais na descritiva com a quantidade de transacoes e amount

Merchant Name

  • Falar que temos os dados apenas para expense

Channel Transactions

  • Falar que temos os dados apenas para expense

Payment Methods

  • just 4% of transactions with this data, it’s almost useless

RFM Transformation

  • Analisando o problema e vendo as possibilidades de cluster, optei pela transdormação dos dados trnsacionais para o modo recency, frequencia e monetary value.

A transformação foi feita pensando em duas abordagens… RFM dos tipos de transacao (expense, income, transfer) e RFM das categorias de expense.

Depois da transformação dos dados transacionais, é hora de juntar com os dados dos usuarios.

Ao final teremos um dataset com 84 customers e 36 variaveis e temos algumas ressalvas.

## Rows: 84
## Columns: 36
## $ id                            <chr> "AD7226ED-2D26-45FD-AB37-C3823CD5DB0C", …
## $ state                         <chr> "NY", "CA", "IL", "NY", NA, "VA", "NY", …
## $ age_at                        <dbl> 34.10458, 32.52759, 33.34352, 29.65223, …
## $ recency_days_expense          <dbl> 24, 30, 30, 31, 22, 0, 0, 0, 0, 0, 0, 0,…
## $ transaction_count_expense     <dbl> 97, 61, 73, 197, 196, 0, 0, 0, 0, 0, 0, …
## $ amount_expense                <dbl> 2825.37, 15654.90, 18785.88, 205553.34, …
## $ amount_mean_expense           <dbl> 29.12753, 256.63770, 257.34082, 1043.417…
## $ recency_days_income           <dbl> 39, 30, 30, 31, 39, 16, 16, 16, 16, 16, …
## $ transaction_count_income      <dbl> 24, 91, 109, 26, 24, 3, 3, 3, 3, 3, 3, 3…
## $ amount_income                 <dbl> 12000.00, 585798.60, 702958.32, 25000.00…
## $ amount_mean_income            <dbl> 500.0000000, 6437.3472527, 6449.1588991,…
## $ recency_days_transfer         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ transaction_count_transfer    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ amount_transfer               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ amount_mean_transfer          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ n_transactions_transfer       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ n_transactions_food_and_drink <dbl> 72, 0, 0, 246, 122, 0, 0, 0, 0, 0, 0, 0,…
## $ n_transactions_payment        <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_transactions_recreation     <dbl> 0, 0, 0, 50, 24, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_transactions_shops          <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ n_transactions_travel         <dbl> 73, 0, 0, 196, 98, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_transfercat            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ amount_food_and_drink         <dbl> 2537.52, 0.00, 0.00, 53286.50, 27537.52,…
## $ amount_payment                <dbl> 0.0, 0.0, 0.0, 99768.0, 51962.5, 0.0, 0.…
## $ amount_recreation             <dbl> 0, 0, 0, 3925, 1884, 0, 0, 0, 0, 0, 0, 0…
## $ amount_shops                  <dbl> 0, 0, 0, 24000, 12500, 0, 0, 0, 0, 0, 0,…
## $ amount_travel                 <dbl> 12287.85, 0.00, 0.00, 49573.84, 24787.85…
## $ amount_mean_transfercat       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ amount_mean_food_and_drink    <dbl> 35.24333, 0.00000, 0.00000, 216.61179, 2…
## $ amount_mean_payment           <dbl> 0.0, 0.0, 0.0, 2078.5, 2078.5, 0.0, 0.0,…
## $ amount_mean_recreation        <dbl> 0.0, 0.0, 0.0, 78.5, 78.5, 0.0, 0.0, 0.0…
## $ amount_mean_shops             <dbl> 0, 0, 0, 500, 500, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_mean_travel            <dbl> 168.3267, 0.0000, 0.0000, 252.9278, 252.…
## $ no_channel                    <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ other                         <dbl> 0.1655172, 0.0000000, 0.0000000, 0.16666…
## $ in_store                      <dbl> 0.8344828, 0.0000000, 0.0000000, 0.83333…

Expense vs Income

Aqui podemos notar que tem customer com alta renda e pouco gasto, sem renda e sem gasto, renda media e gasto medio e gasto alto e renda media. isso já é uma pista pro nosso cluster.

também temos um outlier nos gastos

Expense Vs Income Vs Age

## Warning: Removed 6 rows containing missing values (geom_point).

A idade nao parece influenciar nos tipos de transacoes na base, quanto a salario e despesas.

Expense vs Income vs Transfer

Os valores transferidos nao parecem ter relacao com os gastos e com a renda, alem de terem poucas transacoes, nao usaremos no cluster.

Relationship between types of expenses

## `summarise()` has grouped output by 'user_id', 'date'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'user_id'. You can override using the `.groups` argument.

Clustering

Notes

Conclusion